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Abstract 



We develop a simple optimization procedure for assigning binary values to the amino acids. The 
binary values are determined by a maximization of the degree of pattern conservation in groups of 
closely related protein sequences. The maximization is carried out at fixed composition. For com- 
positions approximately corresponding to an equipartition of the residues, the optimal encoding is 
found to be strongly correlated with hydrophobicity. The stability of the procedure is demonstrated. 
Our calculations are based upon sequences in the SWISS-PROT database. 
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1 Introduction 



An amino acid sequence is a message in 20-letter code that determines the shape and function of 
the protein. This message is degenerate; amino acids may be exchanged to a certain degree without 
affecting the functionahty of the protein in a drastic way For example, it has been demonstrated 
that the function of A repressor is very tolerant to exchanges of core residues as long as the pattern of 
hydrophobicity remains unchanged . The nature of the degeneracy can be probed by analyzing the 
Dayhoff mutation matrix with respect to conservation of different physico-chemical properties. 
Results of such studies convincingly show that hydrophobicity plays a central role in the formation 
of protein structure Q. 

As a first-order approach to the structure of proteins, it may therefore be tempting to take a simple 
two-letter code where the residues are classified as either hydrophobic or hydrophilic. A two-letter 
code can contain much structural information, as shown by studies of simplified models such as the 
lattice-based HP model |^ or the off-lattice model of Refs. 0. However, there are 2^° possible 
ways to reduce a 20-letter code to a binary code, and it is not obvious that the most efficient 
encoding is closely related to a specific physico-chemical property such as hydrophobicity. 

In this note we present a simple optimization procedure for assigning binary values, ai — ±1, to 
the amino acids. The quantity which is optimized is the degree of pattern conservation in groups 
of closely related proteins sequences. In order to avoid trivial solutions, the optimization is carried 
out for a fixed number of (7^ = -1-1, A^_|_. Depending on 7V_|_, the optimal code may or may not have 
a simple physico-chemical interpretation. Here we shall focus on the results for A''_|_ = 10, which 
turn out to be strongly correlated with hydrophobicity. Although not unexpected (cf Ref . ) , this 
finding underlines the importance of hydrophobicity as the method is free from physico-chemical 
inputs. Furthermore, it should be stressed that the method is global in the sense that all possible 
encodings, for fixed N+, are considered. 

Our calculations are based on groups of protein sequences extracted from the SWISS-PROT database 
1^, each group corresponding to a fixed length and a single protein but different species. Using a 
simple measure of similarity within these groups, the degree of pattern conservation is maximized. 
Although the procedure ignores problems due to insertions and deletions of residues, it turns out 
to be fairly robust. The robustness was tested by separately analyzing different parts of the data 
set. In this way one can also study the stability of individual binary values, and assign refined, 
non-binary, values to the residues. 

Our method is somewhat related to the method of "optimal matching hydrophobicities" by Sweet 
and Eisenberg In this method hydrophobicity values are determined from mutation proba- 
bilities 1^ by using an iterative procedure. Our method uses mutation frequencies rather than 
probabilities, and has the advantage that there is no need to specify initial values. 

Another optimization procedure for determining hydrophobicity values has been proposed by Cor- 
nette et al. JTo[ . This method uses secondary-structure data rather than mutation data; sequence 
segments that form a-helices are analyzed using Fourier methods. The hydrophobicity scale is ob- 
tained by maximizing of the strength of the signal for the 3.6 residue period characteristic of the 
a-helix. A similar method has been used for detecting patterns in biologically related sequences | pT[ . 

The paper is organized as follows. In Sec. 2 we describe our method. In Sec. 3 we discuss the 
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Figure 1: Illustration of the method for one of the sequence groups. The group consists of 16 
sequences of pancreatic hormone with iV = 36, which are shown in the upper box. Shown below are 
two binary assignments and the resulting patterns. 



stability of the method and the results obtained. We end with a brief summary in Sec. 4. 



2 Method 



2.1 Forming groups of sequences 



In our calculations we have used a set of sequence groups extracted from the SWISS-PROT database, 
release 31 1^. Each group consisted of sequences with the same protein name and length N, but 
different biological sources. All possible groups of this type were formed for N < 140. As an 
example, one of the groups is shown in Fig. ^ In order to test the size dependence of the results, 
the data set was divided into two parts corresponding to N < 100 (380 groups containing 1251 
sequences in total) and 100 < N < 140 (227 groups, 717 sequences), respectively. 

In forming these sequence groups we have ignored problems related to insertions and deletions of 
residues. This was done in order to keep the method as free as possible from physico-chemical 
inputs. It is possible that the method can be improved by incorporating some carefully chosen 
alignment technique. Also, the groups are small; typically, they contain three to four sequences. It 
is therefore important to check the stability of the procedure, which will be done in Sec. 3. 

In our analysis we have removed uncertain sequences by ignoring all entries in the database contain- 
ing the feature keys UNSURE (indicates uncertainties in the sequence), NON_TER (an extremity 
of the sequence is not the terminal residue) and NON_CONS (two residues in the sequence are 
not consecutive). Furthermore, we removed sequences containing the letters B (denoting Asp or 
Asn), Z (Gin or Glu) and X (any amino acid). 
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2.2 Comparing binary assignments 



A binary encoding <j of the amino acids is a mapping from the original 20-letter alphabet to a 
two-letter alphabet, which we shall write as ^ — » (t(^) G {+1, —1} where ^ specifies the amino-acid 
type. Using the sequence groups described above, we shall assign a number A"' to each encoding a. 
The quantity A'^ provides a measure of pattern conservation and will be optimized. To define IS." 
we proceed in three steps. 

First we define the distance D'^{^°-, between two arbitrary amino-acid sequences — (^°, . . . , 
and = (Cj, ■ ■ • , Cat) of t^e same length iV, 

N r 1 if s = t 

D^iC,e)=Y.^-^-mMe,) ^^t = { otherwise 

jja^^a^^b^ is the usual Hamming distance between the binary strings {a{£_i), . . . ,a{£,ff)) and 
((t(^J), . . . , (t(^^)) and takes integer values between and N. Note that D"'{£_^,£^^) remains un- 
changed under a simultaneous change of the signs of all a{^)^s, i.e., -D'^(C^,<^^) — (C^,^^) for all 
and if cr~(C) = ~<^{0 f^i' C- This implies a twofold degeneracy in a space. 

Next we define a measure of the flucuations within one group of sequences. For a group labeled k 
consisting of the Pk sequences (all having the same length) we put 

D^^^Y E (2) 

l<a<b<Pk 

where the normalization is chosen so as to make the scaling of D'^ linear in Pk (the sum has 
Pk{Pk — l)/2 terms). A low value signals high degree of similarity between the binary strings. 
The calculation of D"^ is illustrated in Fig. |l| for two different a; D'^ is high to the left and low to 
the right. 

The quantity will not have a unique minimum a = (Tmin unless the group labeled k is large. 
Finally, we therefore take a set of / different sequence groups and define the optimal encoding as 

o"min = min A"" (3) 

(7 

where 

/ 

k=l 

It is not meaningful to minimize A'^(> 0) over all possible cr, since A°' vanishes if a{^) is a constant. 
In order to avoid this trivial solution, we have performed minimizations for different fixed numbers 
of positive cr(^)'s, iV_|-. The twofold degeneracy mentioned above disappears when is held fixed 
and 7V+ 7^ 10. Furthermore, the symmetry implies that it is sufhcient to consider < 10. The 
number of distinguishable encodings varies from 20 for = 1 to 92378 for = 10. 



3 RESULTS 
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9 

10 
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+ - + + 
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+ - + + 


------ + + + 

------ + + + 

- + -- -- + + + 


1 
2 
3 
4 
5 
6 
7 


- + ----- 

- + ---- + 

- + ---- + 

- + -- + - + 

- + -- + - + 

- + -- + - + 


- - - + 

- - - + 


------- + - 

------- + - 

------- + - 

------- + + 

------- + + 

------- + + 

-- + -- -- + + 


8 
9 

10 


- + -- + -- 

- + -- + - + 

- + -- + - + 


+ - + + 
+ - + + 
+ - + + 


------ + + + 

------ + + + 

- + -- -- + + + 



Table 1: amin for different values of N+. The results were obtained by using groups of length 
N < 100 (top) and 100 < N < 140 (bottom). For A^_|_ = 10 there are two symmetry-related copies 
of (Tmin, and we have chosen the one with a (A) = — 1. 



3.1 Minimizing A"^ 

We have performed the minimization described in the previous section for N < 100 and 100 < N < 
140 and for = 1, . . . , 10. In this subsection we present the results and discuss the robustness of 
the procedure. In the next subsection we discuss the interpretation of the results. 

The minimizations were carried out in a straightforward way by enumerating all possible encodings, 
which turned out to be feasible for all iV+. In Table Q we show the minimum of A'^, (Tmin, for 
different N and N^. The interpretation of (Tmin depends on N+, as will be discussed below. Results 
for different are therefore not to be thought of as corresponding to some fixed property such 
as hydrophobicity (the results for A'^_|_ = 10 are strongly correlated with hydrophobicity, as will be 
shown below). 

From Table |l| it can be seen that the size dependence of the results is fairly weak, and that the 
structure of CTmin varies slowly with A^+. These observations indicate a certain degree of stability. 
In order to check the stability of the method in more detail, we divided the data set into blocks (see 
Table ^ and performed a separate optimization for each block. In Table |^ we show the results of 
these calculations for = 10. For each block the irrelevant overall sign of (Tmin (see Sec. 2) was 
chosen so as to minimize the Hamming distance to (Tmin for the full data set. For the full data set the 
overall sign was fixed by setting (Tmin(^) = — 1- The results shown in Table 3 clearly demonstrate 
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Block 


First entry 


Last entry 


No. of sequences 


No. of groups 


Bl 


ACBP_BOVIN 
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51 


B2 
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47 


CI 


ACP_BRACM 


CYC.KLULA 


117 


33 


C2 


CYC_BRAOL 


H3_V0LCA 


119 
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33 


C5 
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41 


C6 


VDBP.CAMVC 


YV1_TYLCH 


120 


45 



Table 2: The blocks used for N < 100 (top) and 100 < TV < 140 (bottom). 
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Full 


- + -- + - + + - + + - + -- -- + + + 


Bl 
B2 
B3 
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B5 
B6 
B7 
B8 


+ -- - + + - + - + + - + -- + + +- - 
---- + - + + - + + - + - + -- + + + 
+ -- - + -- + - + + - + -- - + + + + 
+ -- - + + - + - + + - + -- + + +- - 

- + -- + - + + - + + - + -- -- + + + 

- + -- + - + + - + + - + -- -- + + + 

- + -- + - + + - + + - + -- -- + + + 

- + -- + - + - + - + + - + + -- - + + 


Full 


- + -- + - + + - + + - + -- -- + + + 


CI 

C2 
C3 
C4 
C5 
C6 


- + -- + - + + - + + -- - + -- + + + 

- + -- + - + + - + + -- + -- - + + + 

- + -- + - + + - + + - + -- -- + + + 
+ -- - + -- + - + + - + -- + + + + - 

- + -- + - + + - + + - + -- -- + + + 
+ + -- + -- + - + + -- -- - + + + + 



Table 3: (Xmin as obtained using the full data set and the different blocks in Table 2 ~ 10). The 
results in the upper part are for N < 100, whereas those in the lower part are for 100 < N < 140. 



the stability of the method; for example, it can be seen that twelve of the amino acids have been 
assigned the same binary value in twelve or more of the fourteen independent block calculations. 
Note that the stability of the assignment is amino-acid dependent. This will be used below to define 
a refined, non-binary, scale. 

Another way to test the stability is to study the distribution of A'' . In Fig. || we show two histograms 
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Figure 2: a) Histograms of A'^ for N+ = 10, obtained using N < 100 (left) and 100 < iV < 140 
(right). The lower figures are enlargements of the low-A'^ tails, in which we have indicated the 
values of A°' for the different minima amin from the block analysis. The minima tTmin for the blocks 
B5, B6 and B7 coincide with the minimum obtained using the full data set. The same is true for 
the blocks C3 and C5. 



of A"' corresponding to < 100 and 100 < N < 140, respectively, for 7V+ = 10. These histograms 
were obtained using the full data sets. Also using the full data set, we then computed A"' for each 
Cmin from the block analysis. The positions of these fourteen values are shown in Fig. ^. They 
are all located in the low 0.1% tails of the histograms, which gives another demonstration of the 
stability of the method. 



3.2 Interpretation 



It is well-known that physico-chemical properties can be extracted from mutation probabilities, 
as, e.g., in the method of Sweet and Eisenberg Our approach is different since it is based on 
mutation frequencies rather than probabilities. Somewhat surprisingly, it turns out that physico- 
chemical information can be obtained in this way also. 

In our method the optimized quantity A'^ is defined as the total number of pattern-breaking bits, 
without reference to the frequency of occurrence of the amino acids. In this way emphasis is put 
on the overall degree of pattern conservation rather than the values assigned to individual amino 
acids. Whether the individual amino-acid values still contain useful information is a priori unclear. 
A necessary condition for that is, of course, that the corresponding degree of pattern conservation 
is relatively high. If, on the other hand, the mutations are more random, then the assigned values 
tend to reflect the frequency of occurrence of the amino acids; the procedure then tends to put rare 
amino acids in the smallest group (corresponding to cr = -t-1 if AL|_ < 10). 

In Fig. ^ we show amino-acid frequency |^ against for the results corresponding to 100 < N < 
140. As can be seen from this figure, there is a strong correlation between frequency and binary 
value for iV_|_ < 7. This does not necessarily imply that the binary values solely reflect frequency; 
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Figure 3: Frequency of occurrence for the amino acids with a = +1 plotted against A^_|_, using data 
for 100 < TV < 140. 



for example, W and C, the amino acids with a = +1 for — 2 (100 < N < 140), are not only 
rare but have, in fact, been found to be the least mutable residues ||l^. However, it is clear that in 
order to extract any physico-chemical information for < 7 a more detailed analysis is required, 
which is beyond the scope of the present paper. 

For > 8 it is clear that the binary values contain non-trivial information since the correlation 
between binary value and frequency is fairly weak. In what follows we shall discuss the results for 
— 10 in some detail. 

In order to take into account the fact that the stability of the binary value is amino- acid dependent, 
we begin by forming the average a(^) of the results from the block analysis; 

k 

«(^) = ^E^mL(0 (5) 

i=l 

where cr^j,j denotes the result obtained using data block i. As the size dependence was found to be 
weak, the average has been taken over aU the blocks in Table | (A: = 14). 

In Fig. ^ we have plotted a(^) against hydrophobicity, using the scale of Fauchere and Pliska 
From this figure it is clear that a(^) is strongly correlated with hydrophobicity. The correlation 
coefficient is 0.91, which, in fact, is a representative value for what Cornette et al. typically found 
when comparing various existing hydrophobicity scales JTof . As an example, we show in Fig. 
the scale of Fauchere and Pliska against that of Roseman (correlation coefficient 0.93). The 
correlation between a(^) and hydrophobicity is much stronger than that between a(^) and frequency 
of occurrence (correlation coefficient -0.32). 

The correlation between a(^) and hydrophobicity becomes weaker as iV_|_ decreases, but remains 
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Figure 4: a) a(^) against the hydrophobicity scale of Fauchere and Pliska |T2|, F(^). b) F(^) against 
the hydrophobicity scale of Roseman [O, 



fairly strong down to = 8. 



4 Summary 

We have developed a simple optimization procedure based on pattern conservation for assigning 
binary values to the amino acids. In this method the optimal encoding is determined by a global 
search at fixed composition, i.e., fixed value of the parameter N^. The interpretation of the optimal 
encoding depends on N+. For — 10 we have shown that the results are strongly correlated 
with hydrophobicity. Since the method is global and free from physico-chemical inputs, this finding 
illustrates the importance of the hydrophobicity pattern. 

The stability of the method was demonstrated by applying it to independent sets of protein se- 
quences. The method can probably be improved by incorporating some sequence alignment tech- 
nique. However, it is important that this is done in such a way that the amount of physico-chemical 
input is kept at a minimum. The method can easily be generalized to non-binary scales. 

As an example of a possible application of our method, let us mention the question of how the 
statistical distribution of amino acids along functional protein sequences differs from a random 
distribution. This question has recently been addressed using binary assignments based on hy- 
drophobicity [Q. By studying long-range correlations, it was demonstrated that protein sequences 
differ from random sequences in a statistically significant way. Alternative assignments such as 
those presented here may be useful in studying the nature of these deviations from randomness. 
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